#Introduction This analysis was done by Mike Goldweber, Sept 2023. Submitted to DrivenData.org on Oct 6, 2023. This document shows; the step by step process of analyzing the emergency room data provided in the Unsupervised Wisdom contest.

The full set of project files can be found at: https://github.com/halciber/Machine-Learning/tree/master/UnsupervisedWisdom.

Data Ingestion

The data used in this project is the Primary data set. This contains 115,128 rows of data. The data is pulled into two dataframes. One called df_original, and the other is called df_mapped. The df_mapped set is modified by the example code so that it produces human readable data. the df_original set will be modified by the feature engineer to be usable by the modeling done below.

#You'll want to adjust your file path for your environment
filepath <- "c:\\_working\\Machine-Learning\\UnsupervisedWisdom\\Data\\primary_data.csv"

df_original <- read.csv(filepath)

Setting up the categorical mapping dataframe

The code for the next two blocks was take directly from the Community Code section of the contest website, found at: https://www.drivendata.org/competitions/217/cdc-fall-narratives/community-code/.The purpose of this block is to take the JSON coding information and match it to the numeric values found in primary_data.csv. I’ve found this version of the dataset to be helpful in understanding the data. For example, “57 - FRACTURE” is more meaningful than the original “57”. My intention is to use the df_mapped dataframe primarily with the data exploration and the. df_original, created in the code block above will be used for the modeling work after modification in the Feature Engineer section below.

library(jsonlite)

mappingfile <- 'c:\\_working\\Machine-Learning\\UnsupervisedWisdom\\Data\\variable_mapping.json'
mapping <- fromJSON(mappingfile)
names(mapping)
 [1] "sex"              "race"             "hispanic"         "alcohol"          "drug"             "body_part"        "body_part_2"     
 [8] "diagnosis"        "diagnosis_2"      "disposition"      "location"         "fire_involvement" "product_1"        "product_2"       
[15] "product_3"       
# Convert to data frames so we can use in joins
mapping_tables <- list()
for (col in names(mapping)) {
    mapping_tables[[col]] <- data.frame(
        ind=as.integer(names(mapping[[col]])),  # change to integer types
        values=unlist(mapping[[col]])
    )
}

Now that the JSON information has been ingested, we’ll map the categories on to the dataframe, which we will call df_mapped.

library(dplyr)

# Load primary data
df_mapped <- df_primary

# Join and replace encoded column
for (col in names(mapping)) {
    df_mapped <- df_mapped %>%
        left_join(mapping_tables[[col]], by=setNames("ind", col)) %>%
        mutate(!!col := values) %>%
        select(-values)
}

Data Clean Up/ Preliminary Feature Engineer

Let’s look at the dataset to get an initial feel for the data quality by running a summary

summary(df_original)
 cpsc_case_number     narrative         treatment_date          age              sex             race         other_race           hispanic       diagnosis    
 Min.   :190103269   Length:115128      Length:115128      Min.   : 65.00   Min.   :1.000   Min.   :0.0000   Length:115128      Min.   :0.000   Min.   :42.00  
 1st Qu.:200255706   Class :character   Class :character   1st Qu.: 72.00   1st Qu.:1.000   1st Qu.:0.0000   Class :character   1st Qu.:0.000   1st Qu.:57.00  
 Median :210527801   Mode  :character   Mode  :character   Median : 79.00   Median :2.000   Median :1.0000   Mode  :character   Median :2.000   Median :57.00  
 Mean   :208188741                                         Mean   : 79.35   Mean   :1.631   Mean   :0.7957                      Mean   :1.259   Mean   :58.69  
 3rd Qu.:220460723                                         3rd Qu.: 86.00   3rd Qu.:2.000   3rd Qu.:1.0000                      3rd Qu.:2.000   3rd Qu.:62.00  
 Max.   :230222638                                         Max.   :112.00   Max.   :2.000   Max.   :6.0000                      Max.   :2.000   Max.   :74.00  
                                                                                                                                                               
 other_diagnosis     diagnosis_2    other_diagnosis_2    body_part      body_part_2     disposition       location     fire_involvement       alcohol       
 Length:115128      Min.   :41.0    Length:115128      Min.   : 0.00   Min.   : 0.00   Min.   :1.000   Min.   :0.000   Min.   :0.0000000   Min.   :0.00000  
 Class :character   1st Qu.:53.0    Class :character   1st Qu.:37.00   1st Qu.:35.00   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0.0000000   1st Qu.:0.00000  
 Mode  :character   Median :59.0    Mode  :character   Median :75.00   Median :75.00   Median :1.000   Median :1.000   Median :0.0000000   Median :0.00000  
                    Mean   :59.8                       Mean   :65.51   Mean   :63.98   Mean   :2.122   Mean   :1.715   Mean   :0.0005993   Mean   :0.02248  
                    3rd Qu.:63.0                       3rd Qu.:79.00   3rd Qu.:79.00   3rd Qu.:4.000   3rd Qu.:1.000   3rd Qu.:0.0000000   3rd Qu.:0.00000  
                    Max.   :74.0                       Max.   :94.00   Max.   :94.00   Max.   :6.000   Max.   :9.000   Max.   :3.0000000   Max.   :1.00000  
                    NA's   :71983                                      NA's   :71983                                                                        
      drug           product_1      product_2        product_3      
 Min.   :0.00000   Min.   : 110   Min.   :   0.0   Min.   :   0.00  
 1st Qu.:0.00000   1st Qu.:1715   1st Qu.:   0.0   1st Qu.:   0.00  
 Median :0.00000   Median :1807   Median :   0.0   Median :   0.00  
 Mean   :0.04035   Mean   :2168   Mean   : 504.1   Mean   :  56.16  
 3rd Qu.:0.00000   3rd Qu.:3299   3rd Qu.: 474.8   3rd Qu.:   0.00  
 Max.   :1.00000   Max.   :5043   Max.   :5040.0   Max.   :5040.00  
                                                                    

Usually a concern is missing data scattered randomly thoughtout the set. In this case, the summary shows the data is in good shape. That is there isn’t much in the way of missing data. The only NA’s we see are in the diagnosis_2 and body_part_2 columns. This isn’t a suprise, because the ER patients didn’t necessarily suffer secondary injuries. There is a correlation between these two columns, given the identical number of NAs at 71,983. These columns will have to be explored further to determine if it should be used in our modeling.

However, glancing at the data sets directly shows gaps in some of the other columns. For example body_part_2, other_diagnosis and other_diagnosis_2 contain many gaps.

head(df_mapped)

Data Exploration

This is actually the critical portion of the project, and most of the effort was spent on this work in order to understand the scope of the problem. The results of this exploration affected later portions of the exploration. Please note, the narrative column won’t be looked at in the data exploration. The goal is to use ChatGPT to analyze this data and return helpful results. So, this part of the data is being handled separately.

Let’s begin by looking at correlations between the columns.

Correlation Plot

library(corrplot)

#we need to use numerical values for the correlation. So, we'll make a subset of the data.
df_numsubset<- df_original[, c(4:6, 8:9, 13, 15:22)]

#visualization matrix of the data, looking for correlations
corrplot(cor(df_numsubset), type= "upper")

We see strong correlations between race and hispanic, product_1 and product_2, as well as product_2 and product_3. Honestly, this doesn’t seem to be very helpful. At least as this stage. Our focus is on the injuries. There are some connections to age and some of the other factors. Including alcohol. So, we’ll have to explore this.

Age

Let’s look at the age category. In particular, let’s see if there is a particular age that is hit harder than another age. The block converts it into a table, and the plot shows the frequency of injuries by age.

library(ggplot2)


#frequency of injuries by age
data <- df_numsubset[, c('age', 'sex')]
df_agefreq <- as.data.frame(table(data$age))
colnames(df_agefreq)[1] <- "age" 
colnames(df_agefreq)[2] <- "frequency" 
#head(df_agefreq)
ggplot(df_agefreq) + 
    geom_count(mapping = aes(x=age, y=frequency)) + 
    labs(title = "Frequency of Injury By Age", x="Age", y="Count")

What is interesting about this plot, it shows that the injury frequency is relatively high (over 3500) until age 88. I am wondering if the fall off is due to some environmental movement, or is the population over 88 shrinking? There is a bit of a plateau for ages 71-80, where each group has over 4000 injuries, except for age 76 with 3993 injuries.

Sex (Gender)

Next, let’s look at the breakdown of sex (gender) in this dataset.

#63% of the injuries are to women

genderlabels <- c("Male", "Female")
gender <- as.vector(df_original[ ,'sex'])
gentable <- as.data.frame(table(gender))

gentable$gender <- mapvalues(gentable$gender, c( "1", "2"), genderlabels)

pie(gentable$Freq, labels = genderlabels, main="Sex Breakdown")


percentoffemales <- (gentable[2, 'Freq']/nrow(df_original)*100)
out <- sprintf("Percentage of females in this dataset: %f)", percentoffemales) #percentage of females in this dataset
out
[1] "Percentage of females in this dataset: 63.115836)"

This plot visually shows the breakdown of sex in this dataset. We see that at 63.1%, females represents the majority of cases in this dataset. So, we’ll pay particular attention to the factors affecting this part of the population.

Age and Gender

Next, let’s look at the split by age and gender.

library(ggplot2)
library(sqldf)
library(reshape2)

results <- sqldf('SELECT age, sex, count(sex) AS "frequency"
                        FROM dfmapping
                        GROUP BY age, sex')

# reshape the dataframe into a long format
results <- melt (results, id.vars = c ("sex", "age"))

# plot two lines for different genders
ggplot (results, aes (x = age, y = value, group = sex, color = sex)) +
  geom_line () +
  labs (x = "Age", y = "Injury Frequency", color = "Sex")

This chart shows a significant difference by gender across the ages of the patients. First, this plot confirms the previous pie chart plot by showing the female population suffers greater numbers of injuries across the age spectrum. As we explore further, we’ll have to consider the possibility of using different approaches for helping each gender. Of note, this shows the data set only includes males and females. The other categories are not represented in this dataset.

Data Exploration by Race and Sex

Next, let’s factor in the race of the patients to see if any particular group is affected more than the others. After looking at the upper level data by race, we’ll look at the hispanic breakdown at the population.

Breakdown By Race

We’ll start off with a simple pie chart to see.


if (system.file(package = "waffle") == "") {
  install.packages("waffle")
}

library(ggplot2)
library(waffle)

rcounts <- as.data.frame(table(df_mapped$race))
colnames(rcounts)[1]<-"race"
colnames(rcounts)[2]<-"frequency"

vec <- numeric()
vecS <- character()

for(i in rcounts$frequency)
{
    x <- i/rowcount
    x <- x*100
    vec <- c(vec, x)
    
    s <- ifelse((x > 7.0), sprintf("%.2f%%", x), "")#display only meaning values on the pie chart
    vecS <- c(vecS, s)
}

rcounts <- cbind(rcounts, new_col = vec)
colnames(rcounts)[3]<-"percent"

rcounts <- cbind(rcounts, new_col = vecS)
colnames(rcounts)[4]<-"labels"

ggplot(rcounts, aes(x = "", y = percent, fill = race)) +
  geom_bar(stat="identity", width=1) +
  geom_text(aes(label = labels), position = position_stack(vjust = 0.5)) +
  scale_fill_manual(values = c("#E59866", "#FCF3CF", "#AF601A", "#AED6F1", "#BD0026", "#27AE60", "#FAD7AD")) + ##E59866
  coord_polar("y", start=0) + 
  labs(title = "Breakdown of the Patient Population by Race",
          #subtitle = "test",
          caption = "34.4% did not state their race, making a racial correlation difficult")

This chart is problematic, because of the high percentage of N.S. (not stated). So it maybe difficult to accurately identify a racial component to these injuries.

Hispanic

if (system.file(package = "waffle") == "") {
  install.packages("waffle")
}

library(ggplot2)
library(waffle)

hcounts <- as.data.frame(table(df_mapped$hispanic))
colnames(hcounts)[1]<-"hispanic"
colnames(hcounts)[2]<-"frequency"

vec <- numeric()
vecS <- character()

for(i in hcounts$frequency)
{
    x <- i/rowcount
    x <- x*100
    vec <- c(vec, x)
    
    s <- sprintf("%.2f%%", x)
    vecS <- c(vecS, s)
}

hcounts <- cbind(hcounts, new_col = vec)
colnames(hcounts)[3]<-"percent"

hcounts <- cbind(hcounts, new_col = vecS)
colnames(hcounts)[4]<-"labels"

ggplot(hcounts, aes(x = "", y = percent, fill = hispanic)) +
  geom_bar(stat="identity", width=3) +
  geom_text(aes(label = labels), position = position_stack(vjust = 0.5)) +
  scale_fill_manual(values = c("#E59866","#27AE60", "#FCF3CF")) +   
  coord_polar("y", start=0) + 
  labs(title = "Breakdown of the Hispanic Population",
          #subtitle = "test",
          caption = "Only 3% of the patients identify as Hispanic")

So, the majority of patients are not hispanic, however, we can’t conclude not being hispanic makes a person more likey to fall.

rhcounts <- as.data.frame(table(df_mapped$hispanic, df_mapped$race))
colnames(rhcounts)[1]<-"hispanic"
colnames(rhcounts)[2]<-"race"
colnames(rhcounts)[3]<-"frequency"

webr::PieDonut(rhcounts, aes(hispanic, race, count=frequency), 
              r0 = 0.3, #hides the center
               r1 = 0.7,
               #explode = 3,
               #explodeDonut = TRUE,
               pieLabelSize = 3,
               donutLabelSize = 4,
               title = "Hispanic and Racial Breakdown")
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as of ggplot2 3.3.4.

As with the Hispanic pie chart, and with only 3% of the injured set being hispanic, I’m not convinced that hispanic=yes correlates to a fall. Further, no one racial group stands out Certainly, part of the Unk/Not Stated group is probably hispanic, but we don’t know for certain. Further obfuscating this part of the hispanic set is the N.S part of the race data, which represents 76% of the Unknown hispanic group.

We’ll continue to look at race, but I don’t believe there is anything to be gained from including the hispanic column in our model.

rm(rhcounts)
Warning: object 'rhcounts' not found

Diagnosis

Next is a look at the diagnosis column. We’ll start off by looking for the most heavily diagnosed injuries.

#grouping the diagnosis column, looking for the most frequent injury

##frequency of each diagnosis
dcounts <-df_mapped[, c('diagnosis')]

df_diagnosisfreq <- as.data.frame(table(dcounts))
colnames(df_diagnosisfreq)[1] <- "diagnosis_code" 
colnames(df_diagnosisfreq)[2] <- "frequency" 

#sort the data in descending order
df_diagnosisfreqSorted <- df_diagnosisfreq[order(-df_diagnosisfreq$frequency),]
head(df_diagnosisfreqSorted)

#cleanup
rm(dcounts, df_diagnosisfreq)

From this sorting, we know that fractures, internal injuries, contusions abj, and lacertations make up the majority of the 26 listed injuries by a significant amount.

Seeing the majority (86%) of the dataset is diagnosed with these 4 injuries (fractures, internal injuries, contusions abj, and lacertations), our focus will be on how age, race and location play a part with these injuries.

#cleanup
rm(vec, vecS, vecColor, df_diagnosisfreqSorted)

Fire Involvement and Alcohol

Related to the injuries are the cases of burns and alcohol having an effect on the falling injuries. Let’s take a look to see how the patient population in this dataset has been affected by them.We know from the diagnosis frequencies, that burn injuries

Fire Involvement

##frequency of the burn diagnosis
burncounts <- sqldf('SELECT diagnosis, count(diagnosis) AS "frequency"
                        FROM dfmapping
                        WHERE (diagnosis = "47 - BURN, NOT SPEC." or diagnosis = "49 - BURN, CHEMICAL" or diagnosis = "48 - BURN, SCALD" or diagnosis = "51 - BURNS, THERMAL")
                        GROUP BY diagnosis
                        ORDER BY frequency DESC')

head(burncounts)

##frequency of fire_involvement
fcounts <-df_mapped[, c('fire_involvement')]

df_firefreq <- as.data.frame(table(fcounts))
colnames(df_firefreq)[1] <- "fire_involvement" 
colnames(df_firefreq)[2] <- "frequency" 

#sort the data in descending order
df_FireIsFreqSorted <- df_firefreq[order(-df_firefreq$frequency),]
head(df_FireIsFreqSorted)

#cleanup
rm(burncounts, fcounts, df_firefreq, df_FireIsFreqSorted)

From these numbers, we see there is very few burn injuries. Eightyone total, or 0.7% of the total dataset. For this project, we won’t explore burn injuries or fire involvement further because this is such a small part of the population.

Alcohol

Unlike the burns and fire involvement, there isn’t a set of injuries that specifically connects with alcohol usage. So, if we see a large number of patients that reported alcohol usage, it will be interesting to see which diagnosis were tied to it.

##frequency of the burn diagnosis
alcoholcounts <- sqldf('SELECT alcohol, count(alcohol) AS "frequency"
                        FROM dfmapping
                        GROUP BY alcohol
                        ORDER BY frequency DESC')

head(alcoholcounts)

yestotal <- alcoholcounts[2,2]

percent <- yestotal/rowcount*100
print(sprintf("Percentage of cases where alcohol was related to the falling injury: %.2f%%.", percent))
[1] "Percentage of cases where alcohol was related to the falling injury: 2.25%."
#cleanup
rm(alcoholcounts, yestotal, percent)

The number of alcohol related cases is 2588, or 2.25% of the patients in the dataset. For this project, we won’t explore alcohol further because this represents a small part of the population.

Locations Where Patients Reported Getting Hurt

I want to take a quick look at where people are getting hurt. This could impact how we look for correlations later on.


locationresults <- sqldf('SELECT location, count(location) AS "frequency"
                        FROM dfmapping
                        WHERE (diagnosis = "53 - CONTUSIONS, ABR." or diagnosis = "57 - FRACTURE" or diagnosis = "59 - LACERATION" or diagnosis = "62 - INTERNAL INJURY")
                        GROUP BY location')

pie(locationresults$frequency, labels = locationresults$location, main="Location Breakdown")


#percentoffemales <- (gentable[2, 'Freq']/rowcount*100)
#ut <- sprintf("Percentage of females in this dataset: %f)", percentoffemales) #percentage of females in this dataset
#out

This pie chart isn’t the prettiest, but it does clearly show that Home is the primary location where injuries take place. This is followed up by Public and Unknown. Unknown is a frustrating result for this portion of the data, because it isn’t clear how to mitigate problems at this location.

Diagnosis, Race, and Gender

Let’s look to see if there is any correlation between the injuries, race, and gender, with the focus being on the top 4 injuries identified above.

From this plot, we can see that women are severely diagnosed with “57 - Fractures” and “63 - Internal Injury”; however, women suffer from all of these top diagnosis more than men. So special attention has to be paid for addressing the factors affecting women.

Diagnosis, Sex (Gender), and Location

So far, we’ve seen individual breakdowns of the data, but let’s start mixing up the columns in the hopes of exposing a culprete to the problem.

locationresults <- sqldf('SELECT location, diagnosis, sex, count(sex) AS "frequency"
                        FROM dfmapping
                        WHERE (diagnosis = "53 - CONTUSIONS, ABR." or diagnosis = "57 - FRACTURE" or diagnosis = "59 - LACERATION" or diagnosis = "62 - INTERNAL INJURY")
                        GROUP BY location, diagnosis, sex')
sum(locationresults$frequency) #results = 99868
[1] 99868
    #<-    df_firefreq[order(-df_firefreq$frequency),]
sorted <- locationresults[order(-locationresults$frequency),]

head(sorted)

rm(locationresults, sorted)

This result is interesting because we see that fractures and internal injuries affect women by a significant amount at home.

Location, Diagnosis, Body Part

Now that we’ve identified females are being disproportional hurt at home, which body_part is affected the most?

I’m focused on the women, however, let’s continue to look at men in the query.

locationresults <- sqldf('SELECT location, diagnosis, body_part, sex, count(sex) AS "frequency"
                        FROM dfmapping
                        WHERE (diagnosis = "53 - CONTUSIONS, ABR." or diagnosis = "57 - FRACTURE" or diagnosis = "59 - LACERATION" or diagnosis = "62 - INTERNAL INJURY")
                        GROUP BY location, diagnosis, body_part,  sex')
sum(locationresults$frequency) #results = 99868
[1] 99868
    #<-    df_firefreq[order(-df_firefreq$frequency),]
sorted <- locationresults[order(-locationresults$frequency),]

head(sorted)
NA

These results are helpful. Previously we saw fractures at the most serious problem. However, when we factor in the body_part we see the internal injuries to the head at home and in the public location is the most serious problem

rm(locationresults, sorted)

Let’s look at this same data with just the women.


locationresults <- sqldf('SELECT location, diagnosis, body_part, sex, count(sex) AS "frequency"
                        FROM dfmapping
                        WHERE (diagnosis = "53 - CONTUSIONS, ABR." or diagnosis = "57 - FRACTURE" or diagnosis = "59 - LACERATION" or diagnosis = "62 - INTERNAL INJURY") AND
                                (sex = "FEMALE")
                        GROUP BY location, diagnosis, body_part,  sex')
sum(locationresults$frequency) #results = 99868
[1] 63248
    #<-    df_firefreq[order(-df_firefreq$frequency),]
sorted <- locationresults[order(-locationresults$frequency),]

head(sorted)
NA

From this table, we see that when the diagnosis is an internal injury, the head is the body part injuried the most. Home is the most likely place for the injuries, followed by the public or unknown locations. Next most frequent injury is the upper or lower trunk experiencing fractures.

Let do a similar query for the males to see how they are affected.


locationresults <- sqldf('SELECT location, diagnosis, body_part, sex, count(sex) AS "frequency"
                        FROM dfmapping
                        WHERE (diagnosis = "53 - CONTUSIONS, ABR." or diagnosis = "57 - FRACTURE" or diagnosis = "59 - LACERATION" or diagnosis = "62 - INTERNAL INJURY") AND
                                (sex = "MALE")
                        GROUP BY location, diagnosis, body_part,  sex')
sum(locationresults$frequency) #results = 99868
[1] 36620
    #<-    df_firefreq[order(-df_firefreq$frequency),]
sorted <- locationresults[order(-locationresults$frequency),]

head(sorted)
NA

The results for men are very similar in that internal injuries to the head are hit the male population with the greatest frequency. Also, like the females, we see that males experience the fractures to the lower and upper trunk most frequently (after the head injuries), and the major of these injuries are at home.

Connections to Products?

Next, let’s see if there is a tie to the products.


locationresults <- sqldf('SELECT location, diagnosis, body_part, product_1,  sex, count(sex) AS "frequency"
                        FROM dfmapping
                        WHERE (diagnosis = "53 - CONTUSIONS, ABR." or diagnosis = "57 - FRACTURE" or diagnosis = "59 - LACERATION" or diagnosis = "62 - INTERNAL INJURY")
                        GROUP BY location, diagnosis, body_part, product_1,  sex')
#sum(locationresults$frequency) #results = 99868
    #<-    df_firefreq[order(-df_firefreq$frequency),]
sorted <- locationresults[order(-locationresults$frequency),]

head(sorted)
NA

This query shows the majority of products are tied to floors or flooring materials. Does this mean the patient just fell on the floor leading to the internal injuries and fractures? Also of note home, internal injury, and heads are the most likely to be the combination for our patients. Let’s look at the same data with just the floors.


locationresults <- sqldf('SELECT location, diagnosis, body_part, product_1,  count(product_1) AS "frequency"
                        FROM df_mapped
                        WHERE (diagnosis = "53 - CONTUSIONS, ABR." or diagnosis = "57 - FRACTURE" or diagnosis = "59 - LACERATION" or diagnosis = "62 - INTERNAL INJURY")
                                AND (product_1 = "1807 - FLOORS OR FLOORING MATERIALS")
                        GROUP BY location, diagnosis, body_part, product_1')
sorted <- locationresults[order(-locationresults$frequency),]
head(sorted)

sum <- sum(locationresults$frequency)
percent <- sum/rowcount * 100

print(   sprintf("The total frequency of the floor/ flooring materials injuries is %i, or %.2f%%.", sum, percent ))
[1] "The total frequency of the floor/ flooring materials injuries is 29348, or 25.49%."

As we combine the various data columns to our data exploration, we are seeing distinct groupings of the factors that make up this dataset.

Tieing Together the Data Exploration.

As previously observed, females at home, suffer internal injuries to their heads when landing on the floors. Let’s look at the revised dataset when we look at the most frequently sustain injuries.

sunburst_data <- data.frame(
  ids = c("Total", paste0(locationresults$sex, "-", locationresults$location), paste0(locationresults$sex, "-", locationresults$location, "-", locationresults$diagnosis), paste0(locationresults$sex, "-", locationresults$location, "-", locationresults$diagnosis, "-", locationresults$body_part)),
  labels = c("Total", paste0(locationresults$sex, "\n", locationresults$location), paste0(locationresults$diagnosis), paste0(locationresults$body_part)),
  parents = c("", rep("Total", 6), paste0(locationresults$sex, "-", locationresults$location), paste0(locationresults$sex, "-", locationresults$location, "-", locationresults$diagnosis)),
  values = c(sum(data$frequency), rep(NA, 6), locationresults$frequency),
  stringsAsFactors = FALSE
)
Error in data$frequency : object of type 'closure' is not subsettable

Body Part 2, Products 2, and Product 3

I’m reluctant to delve too deeply into these columns. First, these 3 columns are mostly blank, so where there is data in these columns, it seems to be a secondary result of the body_part and product_1. In other words, if we come up with a way to prevent injury to body_part with product_1, then we can prevent problems with body_part_2 and product_2 and product_3.

Addition Feature Engineering

library(waffle)

results <- sqldf('SELECT age, sex, count(sex) AS "frequency"
                        FROM df_mapped
                        GROUP BY age, sex')

ggplot(data5) + 
geom_point(mapping = aes(x=diagnosis, y=frequency, color=race, alpha=race, shape=sex)) + 
    labs(title = "Frequency of Injury By Diagnosis, Gender, & race <100", x="Diagnosis", y="Count")

Data Modeling with Hiearchical Model

Using ChatGPT for Analysis of the Narrative Data

---
title: "Emergency Room Falling Injury Analysis"
output: html_notebook
---

#Introduction
This analysis was done by Mike Goldweber, Sept 2023. Submitted to DrivenData.org on Oct 6, 2023. This document shows;
the step by step process of analyzing the emergency room data provided in the Unsupervised Wisdom contest.

The full set of project files can be found at: https://github.com/halciber/Machine-Learning/tree/master/UnsupervisedWisdom.

# Data Ingestion
The data used in this project is the Primary data set. This contains 115,128 rows of data. The data is pulled into two dataframes. One called df_original, and the other is called df_mapped. The df_mapped set is modified by the example code so that it produces human readable data. the df_original set will be modified by the feature engineer to be usable by the modeling done below.



```{r}
#You'll want to adjust your file path for your environment
filepath <- "c:\\_working\\Machine-Learning\\UnsupervisedWisdom\\Data\\primary_data.csv"

df_original <- read.csv(filepath)

#we'll use this variable frequently
rowcount <- nrow(df_mapped)
```

## Setting up the categorical mapping dataframe
The code for the next two blocks was take directly from the Community Code section of the contest website, found at: https://www.drivendata.org/competitions/217/cdc-fall-narratives/community-code/.The purpose of this block is to take the JSON coding information and match it to the numeric values found in primary_data.csv. I've found this version of the dataset to be helpful in understanding the data. For example, "57 - FRACTURE" is more meaningful than the original "57". My intention is to use the df_mapped dataframe primarily with the data exploration and the. df_original, created in the code block above will be used for the modeling work after modification in the Feature Engineer section below.

```{r}
library(jsonlite)

mappingfile <- 'c:\\_working\\Machine-Learning\\UnsupervisedWisdom\\Data\\variable_mapping.json'
mapping <- fromJSON(mappingfile)
names(mapping)

# Convert to data frames so we can use in joins
mapping_tables <- list()
for (col in names(mapping)) {
    mapping_tables[[col]] <- data.frame(
        ind=as.integer(names(mapping[[col]])),  # change to integer types
        values=unlist(mapping[[col]])
    )
}
```

Now that the JSON information has been ingested, we'll map the categories on to the dataframe, which we will call **df_mapped**.
```{r}
library(dplyr)

# Load primary data
df_mapped <- df_primary

# Join and replace encoded column
for (col in names(mapping)) {
    df_mapped <- df_mapped %>%
        left_join(mapping_tables[[col]], by=setNames("ind", col)) %>%
        mutate(!!col := values) %>%
        select(-values)
}


```





# Data Clean Up/ Preliminary Feature Engineer
Let's look at the dataset to get an initial feel for the data quality by running a summary

```{r}
summary(df_original)
```
Usually a concern is missing data scattered randomly thoughtout the set. In this case, the summary shows the data is in good shape. That is there isn't much in the way of missing data. The only NA's we see are in the diagnosis_2 and body_part_2 columns. This isn't a suprise, because the ER patients didn't necessarily suffer secondary injuries. There is a correlation between these two columns, given the identical number of NAs at 71,983. These columns will have to be explored further to determine if it should be used in our modeling.

However, glancing at the data sets directly shows gaps in some of the other columns. For example body_part_2, other_diagnosis and other_diagnosis_2 contain many gaps.

```{r}
#scroll to the right to see the empty columns (example body_part_2, other_diagnosis and other_disagnosis_2).
head(df_mapped)
```


# Data Exploration
This is actually the critical portion of the project, and most of the effort was spent on this work in order to understand the scope of the problem. The results of this exploration affected later portions of the exploration. Please note, the narrative column won't be looked at in the data exploration. The goal is to use ChatGPT to analyze  this data and return helpful results. So, this part of the data is being handled separately.

Let's begin by looking at correlations between the columns.

## Correlation Plot

```{r}
library(corrplot)

#we need to use numerical values for the correlation. So, we'll make a subset of the data.
df_numsubset<- df_original[, c(4:6, 8:9, 13, 15:22)]

#visualization matrix of the data, looking for correlations
corrplot(cor(df_numsubset), type= "upper")
```
We see strong correlations between race and hispanic, product_1 and product_2, as well as product_2 and product_3. Honestly, this doesn't seem to be very helpful. At least as this stage. Our focus is on the injuries. There are some connections to age and some of the other factors. Including alcohol. So, we'll have to explore this.

## Age
Let's look at the age category. In particular, let's see if there is a particular age that is hit harder than another age. The block converts it into a table, and the plot shows the frequency of injuries by age.
```{r}
library(ggplot2)


#frequency of injuries by age
data <- df_numsubset[, c('age', 'sex')]
df_agefreq <- as.data.frame(table(data$age))
colnames(df_agefreq)[1] <- "age" 
colnames(df_agefreq)[2] <- "frequency" 
#head(df_agefreq)
ggplot(df_agefreq) + 
    geom_count(mapping = aes(x=age, y=frequency)) + 
    labs(title = "Frequency of Injury By Age", x="Age", y="Count")
```
What is interesting about this plot, it shows that the injury frequency is relatively high (over 3500) until age 88. I am wondering if the fall off is due to some environmental movement, or is the population over 88 shrinking? There is a bit of a plateau for ages 71-80, where each group has over 4000 injuries, except for age 76 with 3993 injuries. 

## Sex (Gender)
Next, let's look at the breakdown of sex (gender) in this dataset.

```{r}
#63% of the injuries are to women

genderlabels <- c("Male", "Female")
gender <- as.vector(df_original[ ,'sex'])
gentable <- as.data.frame(table(gender))

gentable$gender <- mapvalues(gentable$gender, c( "1", "2"), genderlabels)

pie(gentable$Freq, labels = genderlabels, main="Sex Breakdown")

percentoffemales <- (gentable[2, 'Freq']/rowcount*100)
out <- sprintf("Percentage of females in this dataset: %f)", percentoffemales) #percentage of females in this dataset
out

```
This plot visually shows the breakdown of sex in this dataset. We see that at **63.1%**, females represents the majority of cases in this dataset. So, we'll pay particular attention to the factors affecting this part of the population.


```{r, include=FALSE}
#cleanup
rm(gender, gentable, genderlables, out)
```





### Age and Gender
Next, let's look at the split by age and gender.

```{r}
library(ggplot2)
library(sqldf)
library(reshape2)

results <- sqldf('SELECT age, sex, count(sex) AS "frequency"
                        FROM df_mapped
                        GROUP BY age, sex')

# reshape the dataframe into a long format
results <- melt (results, id.vars = c ("sex", "age"))

# plot two lines for different genders
ggplot (results, aes (x = age, y = value, group = sex, color = sex)) +
  geom_line () +
  labs (x = "Age", y = "Injury Frequency", color = "Sex")
```

```{r, include=FALSE}  
rm(results)
```


This chart shows a significant difference by gender across the ages of the patients. First, this plot confirms the previous pie chart plot by showing the female population suffers greater numbers of injuries *across the age spectrum*. As we explore further, we'll have to consider the possibility of using different approaches for helping each gender. Of note, this shows the data set only includes males and females. The other categories are not represented in this dataset.


## Data Exploration by Race and Sex
Next, let's factor in the race of the patients to see if any particular group is affected more than the others. After looking at the upper level data by race, we'll look at the hispanic breakdown at the population.


### Breakdown By Race
We'll start off with a simple pie chart to see.


```{r}

if (system.file(package = "waffle") == "") {
  install.packages("waffle")
}

library(ggplot2)
library(waffle)

rcounts <- as.data.frame(table(df_mapped$race))
colnames(rcounts)[1]<-"race"
colnames(rcounts)[2]<-"frequency"

vec <- numeric()
vecS <- character()

for(i in rcounts$frequency)
{
    x <- i/rowcount
    x <- x*100
    vec <- c(vec, x)
    
    s <- ifelse((x > 7.0), sprintf("%.2f%%", x), "")#display only meaning values on the pie chart
    vecS <- c(vecS, s)
}

rcounts <- cbind(rcounts, new_col = vec)
colnames(rcounts)[3]<-"percent"

rcounts <- cbind(rcounts, new_col = vecS)
colnames(rcounts)[4]<-"labels"

ggplot(rcounts, aes(x = "", y = percent, fill = race)) +
  geom_bar(stat="identity", width=1) +
  geom_text(aes(label = labels), position = position_stack(vjust = 0.5)) +
  scale_fill_manual(values = c("#E59866", "#FCF3CF", "#AF601A", "#AED6F1", "#BD0026", "#27AE60", "#FAD7AD")) + ##E59866
  coord_polar("y", start=0) + 
  labs(title = "Breakdown of the Patient Population by Race",
          #subtitle = "test",
          caption = "34.4% did not state their race, making a racial correlation difficult")
```    

```{r, include=FALSE}  
rm(vec, vecS, x, s, rcounts)
```
This chart is problematic, because of the high percentage of N.S. (not stated). So it maybe difficult to accurately identify a racial component to these injuries. 

### Hispanic


```{r}
if (system.file(package = "waffle") == "") {
  install.packages("waffle")
}

library(ggplot2)
library(waffle)

hcounts <- as.data.frame(table(df_mapped$hispanic))
colnames(hcounts)[1]<-"hispanic"
colnames(hcounts)[2]<-"frequency"

vec <- numeric()
vecS <- character()

for(i in hcounts$frequency)
{
    x <- i/rowcount
    x <- x*100
    vec <- c(vec, x)
    
    s <- sprintf("%.2f%%", x)
    vecS <- c(vecS, s)
}

hcounts <- cbind(hcounts, new_col = vec)
colnames(hcounts)[3]<-"percent"

hcounts <- cbind(hcounts, new_col = vecS)
colnames(hcounts)[4]<-"labels"

ggplot(hcounts, aes(x = "", y = percent, fill = hispanic)) +
  geom_bar(stat="identity", width=3) +
  geom_text(aes(label = labels), position = position_stack(vjust = 0.5)) +
  scale_fill_manual(values = c("#E59866","#27AE60", "#FCF3CF")) +   
  coord_polar("y", start=0) + 
  labs(title = "Breakdown of the Hispanic Population",
          #subtitle = "test",
          caption = "Only 3% of the patients identify as Hispanic")
```
So, the majority of patients are not hispanic, however, we can't conclude not being hispanic makes a person more likey to fall.
```{r, include=FALSE}
#cleanup
rm(vec, vecS, x, s, hcounts)
```


```{r}
if (system.file(package = "moonBook") == "") {
  install.packages("moonBook")
}

if (system.file(package = "webr") == "") {
  install.packages("webr")
}

library(moonBook)
library(ggplot2)
library(webr)
library(dplyr)

rhcounts <- as.data.frame(table(df_mapped$hispanic, df_mapped$race))
colnames(rhcounts)[1]<-"hispanic"
colnames(rhcounts)[2]<-"race"
colnames(rhcounts)[3]<-"frequency"

webr::PieDonut(rhcounts, aes(hispanic, race, count=frequency), 
              r0 = 0.3, #hides the center
               r1 = 0.7,
               #explode = 3,
               #explodeDonut = TRUE,
               pieLabelSize = 3,
               donutLabelSize = 4,
               title = "Hispanic and Racial Breakdown")

```
As with the Hispanic pie chart, and with only 3% of the injured set being hispanic, I'm not convinced that hispanic=yes correlates to a fall. Further, no one racial group stands out Certainly, part of the Unk/Not Stated group is probably hispanic, but we don't know for certain. Further obfuscating this part of the hispanic set is the N.S part of the race data, which represents 76% of the Unknown hispanic group. 


We'll continue to look at race, but I don't believe there is anything to be gained from including the hispanic column in our model.



```{r, ignore=TRUE}
rm(rhcounts)
```



## Diagnosis
Next is a look at the diagnosis column. We'll start off by looking for the most heavily diagnosed injuries.

```{r}
#grouping the diagnosis column, looking for the most frequent injury

##frequency of each diagnosis
dcounts <-df_mapped[, c('diagnosis')]

df_diagnosisfreq <- as.data.frame(table(dcounts))
colnames(df_diagnosisfreq)[1] <- "diagnosis_code" 
colnames(df_diagnosisfreq)[2] <- "frequency" 

#sort the data in descending order
df_diagnosisfreqSorted <- df_diagnosisfreq[order(-df_diagnosisfreq$frequency),]
head(df_diagnosisfreqSorted)

#cleanup
rm(dcounts, df_diagnosisfreq)
```
From this sorting, we know that fractures, internal injuries, contusions abj, and lacertations make up the majority of the 26 listed injuries by a significant amount.

```{r}
#based on the frequency of this plot, we can see that the first 4 items represent the majority of the diagnosis
#lets look at the percentage of the cases

highest4diagnosis <- sum(df_diagnosisfreqSorted$frequency[1:4]) # This equals 99,868
percentoftop4diagnosis <- highest4diagnosis/rowcount    #115128=total injuries 86.7%
print(sprintf("Percentage of the top 4 injuries is: %f%%, or %i total reported injuries.", percentoftop4diagnosis*100, highest4diagnosis))


library(ggplot2)
library(waffle)


rm(vec, vecs)

vec <- numeric()
vecS <- character()
vecColor <- character()

n <- nrow(df_diagnosisfreqSorted)

for(i in 1:n)
{
    row <- df_diagnosisfreqSorted[i, ]
    
    code <- as.character(row$diagnosis_code)
    freq <- as.integer(row$frequency)
    
    print(c(code, freq))
    
    perc <- (freq/rowcount)*100
    
    vec <- c(vec, perc) #add value to a vector
    sDiag <- ifelse((perc > 10.0), sprintf("%s", code), "")#display only meaning values on the pie chart
    vecS <- c(vecS, sDiag)
    
    sColor <- ifelse((i < 5), "#AA0000", "#0000AA")
    vecColor <- c(vecColor, sColor)                     
}#end of apply


df_diagnosisfreqSorted <- cbind(df_diagnosisfreqSorted, new_col = vec)
colnames(df_diagnosisfreqSorted)[3]<-"percent"

df_diagnosisfreqSorted <- cbind(df_diagnosisfreqSorted, new_col = vecS)
colnames(df_diagnosisfreqSorted)[4]<-"labels"

df_diagnosisfreqSorted <- cbind(df_diagnosisfreqSorted, new_col = vecColor)
colnames(df_diagnosisfreqSorted)[5]<-"colors"

ggplot(df_diagnosisfreqSorted, aes(x = "", y = frequency, fill = diagnosis_code)) +
  geom_bar(stat="identity", width=1) +
  geom_text(aes(label = labels), position = position_stack(vjust = 0.5)) +
  #scale_fill_manual(values = df_diagnosisfreqSorted$colors) +
  coord_polar("y", start=0)

```
Seeing the majority (86%) of the dataset is diagnosed with these 4 injuries (fractures, internal injuries, contusions abj, and lacertations), our focus will be on how age, race and location play a part with these injuries.

```{r, ignore=TRUE}
#cleanup
rm(vec, vecS, vecColor, df_diagnosisfreqSorted)
```
### Fire Involvement and Alcohol
Related to the injuries are the cases of burns and alcohol having an effect on the falling injuries. Let's take a look to see how the patient population in this dataset has been affected by them.We know from the diagnosis frequencies, that burn injuries 

#### Fire Involvement
```{r}
##frequency of the burn diagnosis
burncounts <- sqldf('SELECT diagnosis, count(diagnosis) AS "frequency"
                        FROM df_mapped
                        WHERE (diagnosis = "47 - BURN, NOT SPEC." or diagnosis = "49 - BURN, CHEMICAL" or diagnosis = "48 - BURN, SCALD" or diagnosis = "51 - BURNS, THERMAL")
                        GROUP BY diagnosis
                        ORDER BY frequency DESC')

head(burncounts)


##frequency of fire_involvement
fcounts <-df_mapped[, c('fire_involvement')]

df_firefreq <- as.data.frame(table(fcounts))
colnames(df_firefreq)[1] <- "fire_involvement" 
colnames(df_firefreq)[2] <- "frequency" 

#sort the data in descending order
df_FireIsFreqSorted <- df_firefreq[order(-df_firefreq$frequency),]
head(df_FireIsFreqSorted)

#cleanup
rm(burncounts, fcounts, df_firefreq, df_FireIsFreqSorted)
```
From these numbers, we see there is very few burn injuries. Eightyone total, or 0.7% of the total dataset. For this project, we won't explore burn injuries or fire involvement further because this is such a small part of the population.

#### Alcohol
Unlike the burns and fire involvement, there isn't a set of injuries that specifically connects with alcohol usage. So, if we see a large number of patients that reported alcohol usage, it will be interesting to see which diagnosis were tied to it.

```{r}
##frequency of the burn diagnosis
alcoholcounts <- sqldf('SELECT alcohol, count(alcohol) AS "frequency"
                        FROM df_mapped
                        GROUP BY alcohol
                        ORDER BY frequency DESC')

head(alcoholcounts)

yestotal <- alcoholcounts[2,2]

percent <- yestotal/rowcount*100
print(sprintf("Percentage of cases where alcohol was related to the falling injury: %.2f%%.", percent))
#cleanup
rm(alcoholcounts, yestotal, percent)
```
The number of alcohol related cases is 2588, or 2.25% of the patients in the dataset. For this project, we won't explore alcohol further because this represents a small part of the population.

## Locations Where Patients Reported Getting Hurt
I want to take a quick look at where people are getting hurt. This could impact how we look for correlations later on.

```{r}

locationresults <- sqldf('SELECT location, count(location) AS "frequency"
                        FROM df_mapped
                        WHERE (diagnosis = "53 - CONTUSIONS, ABR." or diagnosis = "57 - FRACTURE" or diagnosis = "59 - LACERATION" or diagnosis = "62 - INTERNAL INJURY")
                        GROUP BY location')

pie(locationresults$frequency, labels = locationresults$location, main="Location Breakdown")

#percentoffemales <- (gentable[2, 'Freq']/rowcount*100)
#ut <- sprintf("Percentage of females in this dataset: %f)", percentoffemales) #percentage of females in this dataset
#out
```
This pie chart isn't the prettiest, but it does clearly show that Home is the primary location where injuries take place. This is followed up by Public and Unknown. Unknown is a frustrating result for this portion of the data, because it isn't clear how to mitigate problems at this location.


## Diagnosis, Race, and Gender
Let's look to see if there is any correlation between the injuries, race, and gender, with the focus being on the top 4 injuries identified above.

```{r}
if (system.file(package = "sqldf") == "") {
  install.packages("sqldf")
}

library(sqldf)
library(plyr)
library(dplyr)

#frequency of each diagnosis by gender
results <- sqldf('SELECT diagnosis, sex, count(*) AS "frequency"
                        FROM df_mapped
                        WHERE (diagnosis = "53 - CONTUSIONS, ABR." or diagnosis = "57 - FRACTURE" or diagnosis = "59 - LACERATION" or diagnosis = "62 - INTERNAL INJURY")
                        GROUP BY diagnosis, sex
                        ORDER BY diagnosis, sex, frequency DESC')


#plot them
ggplot(results) + 
    geom_count(mapping = aes(x=diagnosis, y=frequency, color=sex)) + 
    labs(title = "Frequency of Injury By Diagnosis & Gender", x="Diagnosis", y="Count")
```
 From this plot, we can see that women are severely diagnosed with "57 - Fractures" and "63 - Internal Injury"; however, women suffer from all of these top diagnosis more than men. **So special attention** has to be paid for addressing the factors affecting women.
 
## Diagnosis, Sex (Gender), and Location 
So far, we've seen individual breakdowns of the data, but let's start mixing up the columns in the hopes of exposing a culprete to the problem.

```{r}
locationresults <- sqldf('SELECT location, diagnosis, sex, count(sex) AS "frequency"
                        FROM df_mapped
                        WHERE (diagnosis = "53 - CONTUSIONS, ABR." or diagnosis = "57 - FRACTURE" or diagnosis = "59 - LACERATION" or diagnosis = "62 - INTERNAL INJURY")
                        GROUP BY location, diagnosis, sex')
sum(locationresults$frequency) #results = 99868
    #<-    df_firefreq[order(-df_firefreq$frequency),]
sorted <- locationresults[order(-locationresults$frequency),]

head(sorted)

rm(locationresults, sorted)

```
This result is interesting because we see that fractures and internal injuries affect women by a significant amount **at home**.


# Location, Diagnosis, Body Part
Now that we've identified females are being disproportional hurt at home, which body_part is affected the most? 

I'm focused on the women, however, let's continue to look at men in the query.
```{r}
locationresults <- sqldf('SELECT location, diagnosis, body_part, sex, count(sex) AS "frequency"
                        FROM df_mapped
                        WHERE (diagnosis = "53 - CONTUSIONS, ABR." or diagnosis = "57 - FRACTURE" or diagnosis = "59 - LACERATION" or diagnosis = "62 - INTERNAL INJURY")
                        GROUP BY location, diagnosis, body_part,  sex')
sum(locationresults$frequency) #results = 99868
    #<-    df_firefreq[order(-df_firefreq$frequency),]
sorted <- locationresults[order(-locationresults$frequency),]

head(sorted)

```
These results are helpful. Previously we saw fractures at the most serious problem. However, when we factor in the body_part we see the internal injuries to the head at home and in the public location is the most serious problem

```{r, ignore=TRUE}
rm(locationresults, sorted)
```

Let's look at this same data with just the women.
```{r}

locationresults <- sqldf('SELECT location, diagnosis, body_part, sex, count(sex) AS "frequency"
                        FROM df_mapped
                        WHERE (diagnosis = "53 - CONTUSIONS, ABR." or diagnosis = "57 - FRACTURE" or diagnosis = "59 - LACERATION" or diagnosis = "62 - INTERNAL INJURY") AND
                                (sex = "FEMALE")
                        GROUP BY location, diagnosis, body_part,  sex')
sum(locationresults$frequency) #results = 99868
    #<-    df_firefreq[order(-df_firefreq$frequency),]
sorted <- locationresults[order(-locationresults$frequency),]

head(sorted)

```
From this table, we see that when the diagnosis is an internal injury, the head is the body part injuried the most. Home is the most likely place for the injuries, followed by the public or unknown locations. Next most frequent injury is the upper or lower trunk experiencing fractures.

Let do a similar query for the males to see how they are affected.
```{r}

locationresults <- sqldf('SELECT location, diagnosis, body_part, sex, count(sex) AS "frequency"
                        FROM df_mapped
                        WHERE (diagnosis = "53 - CONTUSIONS, ABR." or diagnosis = "57 - FRACTURE" or diagnosis = "59 - LACERATION" or diagnosis = "62 - INTERNAL INJURY") AND
                                (sex = "MALE")
                        GROUP BY location, diagnosis, body_part,  sex')
sum(locationresults$frequency) #results = 99868
    #<-    df_firefreq[order(-df_firefreq$frequency),]
sorted <- locationresults[order(-locationresults$frequency),]

head(sorted)

```
The results for men are very similar in that internal injuries to the head are hit the male population with the greatest frequency. Also, like the females, we see that males experience the fractures to the lower and upper trunk most frequently (after the head injuries), and the major of these injuries are at home.

## Connections to Products?
Next, let's see if there is a tie to the products.

```{r}

locationresults <- sqldf('SELECT location, diagnosis, body_part, product_1,  sex, count(sex) AS "frequency"
                        FROM df_mapped
                        WHERE (diagnosis = "53 - CONTUSIONS, ABR." or diagnosis = "57 - FRACTURE" or diagnosis = "59 - LACERATION" or diagnosis = "62 - INTERNAL INJURY")
                        GROUP BY location, diagnosis, body_part, product_1,  sex')
#sum(locationresults$frequency) #results = 99868
    #<-    df_firefreq[order(-df_firefreq$frequency),]
sorted <- locationresults[order(-locationresults$frequency),]

head(sorted)

```
This query shows the majority of products are tied to floors or flooring materials. Does this mean the patient just fell on the floor leading to the internal injuries and fractures? Also of note home, internal injury, and heads are the most likely to be the combination for our patients. Let's look at the same data with just the floors.

```{r}

locationresults <- sqldf('SELECT location, diagnosis, body_part, product_1,  count(product_1) AS "frequency"
                        FROM df_mapped
                        WHERE (diagnosis = "53 - CONTUSIONS, ABR." or diagnosis = "57 - FRACTURE" or diagnosis = "59 - LACERATION" or diagnosis = "62 - INTERNAL INJURY")
                                AND (product_1 = "1807 - FLOORS OR FLOORING MATERIALS")
                        GROUP BY location, diagnosis, body_part, product_1')
sorted <- locationresults[order(-locationresults$frequency),]
head(sorted)

sum <- sum(locationresults$frequency)
percent <- sum/rowcount * 100

print(   sprintf("The total frequency of the floor/ flooring materials injuries is %i, or %.2f%%.", sum, percent ))

```
As we combine the various data columns to our data exploration, we are seeing distinct groupings of the factors that make up this dataset.


## Tieing Together the Data Exploration.
As previously observed, females at home, suffer internal injuries to their heads when landing on the floors. Let's look at the revised dataset when we look at the most frequently sustain injuries.

```{r}
if (system.file(package = "circlepacketR") == "") {
  install.packages("circlepacketR")
}

library(sqldf)
library(circlepackeR) 
#
locationresults <- sqldf('SELECT sex, location, diagnosis, body_part, count(*) AS "frequency"
                        FROM df_mapped
                        WHERE (diagnosis = "53 - CONTUSIONS, ABR." or diagnosis = "57 - FRACTURE" or diagnosis = "59 - LACERATION" or diagnosis = "62 - INTERNAL INJURY")
                              
                        GROUP BY sex, location, diagnosis, body_part
                        ORDER BY frequency DESC')

rowct <- nrow(locationresults)

data <- data.frame(
  root=rep("root", locationrows),
  group=c(rep(locationresults$sex,rowct),, ), 
  subgroup=  rep(locationresults$location,rowct),
  subsubgroup=rep(rep(locationresults$diagnosis,rowct)),
  value=sample(seq(1:15), )
)





```



## Body Part 2, Products 2, and Product 3
I'm reluctant to delve too deeply into these columns. First, these 3 columns are mostly blank, so where there is data in these columns, it seems to be a secondary result of the body_part and product_1. In other words, if we come up with a way to prevent injury to body_part with product_1, then we can prevent problems with body_part_2 and product_2 and product_3.


# Addition Feature Engineering

```{r}
library(waffle)

results <- sqldf('SELECT age, sex, count(sex) AS "frequency"
                        FROM df_mapped
                        GROUP BY age, sex')

ggplot(data5) + 
geom_point(mapping = aes(x=diagnosis, y=frequency, color=race, alpha=race, shape=sex)) + 
    labs(title = "Frequency of Injury By Diagnosis, Gender, & race <100", x="Diagnosis", y="Count")
```

# Data Modeling with Hiearchical Model

```{r}

```

# Using ChatGPT for Analysis of the Narrative Data

```{r}

```

